Introduction to Web Scraping and Data Management for Social Scientists

Day 1: Introduction

Johannes B. Gruber

2023-07-24

Introduction

The General Plan for This Course

  • We want to teach you web scraping and data management
  • We also want to give you the tools for reproducible and transparent open science research

The Plan for Today

In this session, you learn how to use the tools of the hunt. We will:

  • discuss some useful tools and learn:
    • how to use the terminal (and why you should).
    • how to use git version control with GitHub (you will get the material for the course in this step).
    • how to use R and Python together
  • go over some principles of using the programming language R:
    • R Refresher
    • literate programming

Woody Kelly via unsplash.com

Who am I?

  • PostDoc at Department of Communication Science at Vrije Universiteit Amsterdam and University of Amsterdam
  • Interested in:
    • Computational Social Science
    • Automated Text Analysis
    • Hybrid Media Systems and Information Flows
    • Protest and Democracy
  • Experience:
    • R user for 8 years
    • R package developer for 6 years
    • Worked on several packages for text analysis, API access and web scraping (quanteda.textmodels, LexisNexisTools, paperboy, traktok, amcat4-r, and more)

Who are you?

  • What is your name?
  • What are your research interests?
  • What is your experience with:
    • R
    • HTML
    • webscraping
  • Why are you taking this course?
  • Do you have specific plans that include webscraping?
  • What operating system are you using?

How to use the terminal (and why you should)

Why use the terminal?

While graphical user interfaces (GUIs) can be more visually intuitive and user-friendly, the command-line interface (terminal):

  1. Reproducibility: You use R because you don’t want to memorise the same 15 clicks in Excel and repeat them again and again. Likewise, you can save some terminal commands with things you do regularly
  2. Scripting and Automation: The next step can be to write entire scripts or functions to automate some tasks (like shell scripts in Unix/Linux or batch files in Windows).
  3. Remote Access and Administration: Many remote servers do not come with a GUI as mirroring a desktop interface would be bandwidth-hungry. Tools like SSH (Secure Shell) can provide secure command-line access to remote systems.
  4. Extensive Tooling: Many powerful tools and utilities are command-line based, especially in fields like data science and system administration.
  5. Efficiency and Speed: Command-line interfaces can be faster to work with once you know the commands. This is particularly useful for repetitive tasks, which can be executed with simple commands.

Some use cases

Package manager

  • Similar to the Play or App Store
  • Make it easy to install apps (without accidentally signing up for some service)
  • Much easier to update multiple packages at once
  • Availabe for different operating systems:

Remote Access

Exercises 1

  1. Install git (and if you are on Windows unxutils) through a package manager
  2. Look around what other software you might want to install

Git some Version Control

What is git and why should you use it

Git is a version control system (VCS) that helps keep track of changes made to files and directories in a project. It allows you to revert to previous versions, compare changes over time, and collaborate with others on the same project without overwriting each other’s work.

  • Version control: Keep a history of your project, including every change made.
  • Collaboration: Allows multiple people to work on a project at the same time.
  • Revertibility: Made a mistake? You can always revert to a previous version.
  • Branching and merging: Work on new things without affecting the main project, then combine them when you’re ready.

How to keep track of your work (setting up a repo)

You can use git to track your work by setting up a repository, often called a “repo”. Here’s how:

  1. Navigate to your project’s directory using the terminal (or open the terminal there).
  2. Initialize a new git repository by running the command git init. This creates a new subdirectory named .git that contains all necessary git metadata.

How to commit

Committing is the process by which you save changes to the repository.

  1. To tell git to start tracking changes in specific files, you need to add them to the repository with git add filename. If you want to add all files in the directory, you can use git add ..
  2. After adding the files, you can commit your changes using git commit -m "Your commit message". The message should be a brief description of the changes made.1
  3. You can check the status of your repository (which files have changes, which changes are staged for commit, etc.) using git status.

How to travel back and forth in time

If you make a mistake, or simply want to go back to an earlier version of your project, you can use git checkout [commit hash].

  1. Check the history of your git repository with git log
  2. Copy the hash of the commit you want to revert to
  3. Then, you can use git checkout [commit hash] to go back to an earlier status of your repository
  4. Return to the most recent commit with git checkout master

How to revert

If you want to undo the changes made in one commit, use git revert.

  1. Check the history of your git repository with git log
  2. Copy the hash of the commit you want to undo
  3. Then, you can use git revert --no-commit [commit hash] to go back to an earlier status of your repository
  4. git commit -m "Your commit message"

How to branch

Branching in git allows you to create a separate version of your project to develop and test new features without affecting the main branch.

  1. To create a new branch, use git branch branch-name.
  2. To switch to your newly created branch, use git checkout branch-name.
  3. You can make changes here
  4. You can go back to the main branch with git checkout master

How to rebase

Rebasing is a way to integrate changes from one branch into another. This can be important if you want to merge the changes from a branch to the main branch, but there were changes there that you want to integrate.

  1. First, switch to the branch you want to update with git checkout branch-name.
  2. Then, use git rebase other-branch-name to integrate changes from other-branch-name.

How to merge

Merging is the process of integrating changes from one branch into another.

  1. First, you switch to the branch you want to merge changes into with git checkout branch-name.
  2. Then, you can merge another branch into the current one with git merge other-branch-name.

What is a fork

A fork is a copy of a repository that allows you to freely experiment with changes without affecting the original project. Forking is commonly used in open source projects to propose changes to someone else’s project, or to use someone else’s project as a starting point for your own work.

How to make a pull request

A pull request is a way to propose changes from your fork or branch to the original repository. It’s how you contribute to open source projects on platforms like GitHub.

  1. First, you fork a branch.
  2. You clone your fork repository to a local location using e.g., git clone https://github.com/JBGruber/ess-web-scraping.git
  3. You make changes to the local repository
  4. You add and commit the changes
  5. You use git push to upload the changes to your fork
  6. Then, on GitHub (or the similar platform where the original repository is hosted) there should be a button to open a pull request

Exercises 2

  1. Fork the course repository on GitHub
  2. Clone the course repository to your computer
  3. In the folder “participants” copy the file “participants.csv” and name the copy “participatns_YOUR_NAME”
  4. Fill in your details (if you feel uncomfortable sharing your details with the class and on the public GitHub site, just add “-” to some or all columns)
  5. Commit and push your changes
  6. Make a pull request to the main repo
  7. Optional: After I have merged your pull request, you can set your fork to private and add me and Marius as collaborators (you will submit your homework that way)

R Refresher

Packages

  • R organises its functions in packages (even base functions)
  • Most packages must be installed (once) and attached (every new session)
install.packages("tidyverse")
library(tidyverse)

Accessing Functions

If you do not want to attach an entire package, you can use the Double Colon to only use a specific function:

dplyr::select(iris, Sepal.Length)
    Sepal.Length
1            5.1
2            4.9
3            4.7
4            4.6
5            5.0
6            5.4
7            4.6
8            5.0
9            4.4
10           4.9
11           5.4
12           4.8
13           4.8
14           4.3
15           5.8
16           5.7
17           5.4
18           5.1
19           5.7
20           5.1
21           5.4
22           5.1
23           4.6
24           5.1
25           4.8
26           5.0
27           5.0
28           5.2
29           5.2
30           4.7
31           4.8
32           5.4
33           5.2
34           5.5
35           4.9
36           5.0
37           5.5
38           4.9
39           4.4
40           5.1
41           5.0
42           4.5
43           4.4
44           5.0
45           5.1
46           4.8
47           5.1
48           4.6
49           5.3
50           5.0
51           7.0
52           6.4
53           6.9
54           5.5
55           6.5
56           5.7
57           6.3
58           4.9
59           6.6
60           5.2
61           5.0
62           5.9
63           6.0
64           6.1
65           5.6
66           6.7
67           5.6
68           5.8
69           6.2
70           5.6
71           5.9
72           6.1
73           6.3
74           6.1
75           6.4
76           6.6
77           6.8
78           6.7
79           6.0
80           5.7
81           5.5
82           5.5
83           5.8
84           6.0
85           5.4
86           6.0
87           6.7
88           6.3
89           5.6
90           5.5
91           5.5
92           6.1
93           5.8
94           5.0
95           5.6
96           5.7
97           5.7
98           6.2
99           5.1
100          5.7
101          6.3
102          5.8
103          7.1
104          6.3
105          6.5
106          7.6
107          4.9
108          7.3
109          6.7
110          7.2
111          6.5
112          6.4
113          6.8
114          5.7
115          5.8
116          6.4
117          6.5
118          7.7
119          7.7
120          6.0
121          6.9
122          5.6
123          7.7
124          6.3
125          6.7
126          7.2
127          6.2
128          6.1
129          6.4
130          7.2
131          7.4
132          7.9
133          6.4
134          6.3
135          6.1
136          7.7
137          6.3
138          6.4
139          6.0
140          6.9
141          6.7
142          6.9
143          5.8
144          6.8
145          6.7
146          6.7
147          6.3
148          6.5
149          6.2
150          5.9

Less often used, you can also do this with library:

library("dplyr", include.only = c("select", "mutate"))
mutate(iris, sepal_length = Sepal.Length * 10) |> 
  select(sepal_length)

The Comprehensive R Archive Network (CRAN)

  • Central repository for R packages
  • Rigorous policies and testing
  • Currently almost 20k packages (July 2023)

Other sources?

  • Rigorous policies and testing are also a downside
    • Developers hesitate to submit packages
    • Unmaintained (but functional) packages are removed from CRAN
  • Alternative repositories are common:
    • GitHub and Gitlab (and SVN)
    • Bioconductor, R-Forge and Omegahat
remotes::install_github("JBGruber/paperboy")

Help!

One of the most important commands in R is the ? though:

?install.packages # And
?remotes::install_github

All help files in R follow the same structure and principle (although not all help file contain all elements):

  • Title
  • Description
  • Usage:very important: shows you the default values for all arguments (i.e., what is used if you do not set anything) and assumed order
install_github("JBGruber/paperboy") # Same as
install_github(repo = "JBGruber/paperboy",  ref = "HEAD") # Same as
install_github(ref = "HEAD", repo = "JBGruber/paperboy") # Not(!) same as
install_github("HEAD", "JBGruber/paperboy")
  • Arguments: description of arguments in a function. One special argument is the ... (called ellipsis or dots) which is passed to underlying function.
install_github("JBGruber/paperboy", Ncpus = 6)
  • Details: Usually not that important but this is the first place to look when a function is not doing what you expect
  • Examples: where I usually start to learn a new function by looking at cases that certainly work (and then rewriting them for my purposes).

Help!

  • Google (“ggplot2 r remove legends”)
  • Some good ressources for answers:
    • stackoverflow.com (if you want to ask a question instead see how to ask a good question and use a reproducible example)
    • R help list (stat.ethz.ch)
    • https://www.r-bloggers.com/ (collection of personal blog posts related to R – so quality varies)
  • ChatGPT
library(askgpt)
log_init()
mean[1:10]
askgpt("What is wrong with my last command?")

Functions

Functions are easy to define in R:

new_fun <- function(x = 1) {
  out <- c(
    sum(x),
    mean(x),
    median(x)
  )
  return(out)
}
new_fun()
[1] 1 1 1
vec <- c(1:10)
new_fun(x = vec)
[1] 55.0  5.5  5.5

Going through this bit by bit:

  • new_fun: The name of the new function (convention: use something descriptive; don’t use . or CamelCase but _ if you have multiple words)
  • <-: The assignment operator.
  • function(x): Define arguments and defaults here.
  • {}: Everything inside the rounded brackets is the body of the function (code you are running when calling the function).
  • return(): All objects created inside the function are immediately destroyed when the function finished running. Except what is put in return() (can be implicit).

Loops

For loops

Iterate over a vector:

x <- NULL
for (i in 1:10) {
  message(i)
  x <- c(x, i)
}
x
 [1]  1  2  3  4  5  6  7  8  9 10
  • for: This is how you start the loop
  • i: This is the variable which takes a different value in each iteration of the loop
  • in: separates the variable from the vector
  • 1:10: The vector over which to iterate
  • {}: The expression inside the round brackets is evaluated once for each value in the vector; i takes a different value each run

Apply loops

Apply function to each element of a vector/list:

foo <- function(i, silent = FALSE) {
  if (!silent) {
    message(i) 
  }
  return(i)
}
x <- lapply(1:10, foo)
unlist(x)
 [1]  1  2  3  4  5  6  7  8  9 10

purrr::map loops

Also apply function to each element of a vector/list, but coerce types:

foo <- function(i, silent = FALSE) {
  if (!silent) {
    message(i) 
  }
  return(i)
}
x <- purrr::map_int(1:10, foo)
x
 [1]  1  2  3  4  5  6  7  8  9 10

if

if can be used to conditionally run code:

if (TRUE) {
  1 + 1
}
[1] 2
if (FALSE) {
  1 + 1
}

Any code that evaluates to a logical (TRUE/FALSE) can be used:

if (1 + 1 == 2) {
  "Hello!"
}
[1] "Hello!"

You can extend this with else, which is executed when the original condition is FALSE:

if (1 + 2 == 2) {
  "Hello!"
} else {
  "Bye"
}
[1] "Bye"

base R

Commonly people referring to base R mean all functions available when starting R but not loading any packages with library(package).

df <- mtcars # using a built-in example data.frame
table(df$cyl)

 4  6  8 
11  7 14 
sum(df$cyl)
[1] 198
mean(df$cyl)
[1] 6.1875
dist(head(df)) # calculates euclidian distance between cases
                    Mazda RX4 Mazda RX4 Wag  Datsun 710 Hornet 4 Drive
Mazda RX4 Wag       0.6153251                                         
Datsun 710         54.9086059    54.8915169                           
Hornet 4 Drive     98.1125212    98.0958939 150.9935191               
Hornet Sportabout 210.3374396   210.3358546 265.0831615    121.0297564
Valiant            65.4717710    65.4392224 117.7547018     33.5508692
                  Hornet Sportabout
Mazda RX4 Wag                      
Datsun 710                         
Hornet 4 Drive                     
Hornet Sportabout                  
Valiant                 152.1241352
tolower(row.names(df))
 [1] "mazda rx4"           "mazda rx4 wag"       "datsun 710"         
 [4] "hornet 4 drive"      "hornet sportabout"   "valiant"            
 [7] "duster 360"          "merc 240d"           "merc 230"           
[10] "merc 280"            "merc 280c"           "merc 450se"         
[13] "merc 450sl"          "merc 450slc"         "cadillac fleetwood" 
[16] "lincoln continental" "chrysler imperial"   "fiat 128"           
[19] "honda civic"         "toyota corolla"      "toyota corona"      
[22] "dodge challenger"    "amc javelin"         "camaro z28"         
[25] "pontiac firebird"    "fiat x1-9"           "porsche 914-2"      
[28] "lotus europa"        "ford pantera l"      "ferrari dino"       
[31] "maserati bora"       "volvo 142e"         

Especially for simple operations and statistics, base is still great.

model <- lm(hp ~ mpg, data = df) # simple linear regression
summary(model)

Call:
lm(formula = hp ~ mpg, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-59.26 -28.93 -13.45  25.65 143.36 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   324.08      27.43  11.813 8.25e-13 ***
mpg            -8.83       1.31  -6.742 1.79e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43.95 on 30 degrees of freedom
Multiple R-squared:  0.6024,    Adjusted R-squared:  0.5892 
F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

base R

base also has a plotting system:

plot(df$mpg, df$hp, col = "blue", ylab = "horse power", xlab = "miles per gallon", main = "Simple linear regression")
abline(model, col = "red")
text(30, 300, "We can add some text", col = "red")

Tidyverse

What is it?

  • The official description: “The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures”.
  • The principle that gives the tidyverse its name is that of tidy data: “Each variable forms a column. Each observation forms a row.” (see tidyr vignette for more info)
  • Seems trivial at first but as a principle can be quite consequential (e.g., it means that most object types are ignored and data.frames are very dominant)
  • Some coding principles attached to it (e.g., the pipe, functions as verbs that build on each other)

The pipe

  • Formerly %>%, now native in R as |>
  • Forwards the result of one function to another
  • Makes for much more readable code:
transform(aggregate(. ~ cyl, data = subset(mtcars, hp > 100), FUN = function(x) round(mean(x, 2))), kpl = mpg * 0.4251)
  cyl mpg disp  hp drat wt qsec vs am gear carb     kpl
1   4  26  108 111    4  2   18  1  1    4    2 11.0526
2   6  20  168 110    4  3   18  1  0    4    4  8.5020
3   8  15  350 192    3  4   17  0  0    3    4  6.3765

You Can make this more readable by createing intermediate objects:

data1 <- subset(mtcars, hp > 100) # take subset of original data
data2 <- aggregate(. ~ cyl, data = data1, FUN = function(x) round(mean(x, 2))) # aggregate by taking rounded mean
transform(data2, kpl = mpg * 0.4251) # convert miles per gallon to kilometer per liter
  cyl mpg disp  hp drat wt qsec vs am gear carb     kpl
1   4  26  108 111    4  2   18  1  1    4    2 11.0526
2   6  20  168 110    4  3   18  1  0    4    4  8.5020
3   8  15  350 192    3  4   17  0  0    3    4  6.3765

Or you use the pipe:

subset(mtcars, hp > 100) |> 
  aggregate(. ~ cyl, data = _, FUN = function(x) round(mean(x, 2))) |> 
  transform(kpl = mpg * 0.4251)
  cyl mpg disp  hp drat wt qsec vs am gear carb     kpl
1   4  26  108 111    4  2   18  1  1    4    2 11.0526
2   6  20  168 110    4  3   18  1  0    4    4  8.5020
3   8  15  350 192    3  4   17  0  0    3    4  6.3765

tidyverse functions are written with pipes in mind and are named as verbs with the goal to tell you exactly what they do:

library(tidyverse)
mtcars |> 
  filter(hp > 100) |> 
  group_by(cyl) |> 
  summarise(across(.cols = everything(), .fns = function(x) x |> mean() |> round(2))) |> 
  mutate(kpl = mpg * 0.4251)
# A tibble: 3 × 12
    cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb   kpl
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     4  25.9  108.  111   3.94  2.15  17.8  1     1     4.5   2    11.0 
2     6  19.7  183.  122.  3.59  3.12  18.0  0.57  0.43  3.86  3.43  8.39
3     8  15.1  353.  209.  3.23  4     16.8  0     0.14  3.29  3.5   6.42

Note: You can interject the View() command at any line in a complicated pipeline to see the intermediate result in a spreadsheet-style data viewer.

Special package ggplot2

  • Completely overhauls the plotting system in R
  • IMO: the best plotting system in any programming/data science language
  • Implements the “Grammar of Graphics”: a language for describing custom plots instead of relying on predefined plotting functions
  • The specific logic makes it harder to learn than other packages, but you can express essentially any plots in it (I highly recommend using “ggplot2: Elegant Graphics for Data Analysis” to learn the package instead of individual tutorials)

Exercises 3

  1. Run ggplot(data = mpg). What do you see and why?
  2. In the function pb_collect() from paperboy, what do the arguments ignore_fails and connections do?
  3. Write a function that takes a numeric vector of miles per gallon consumption data and transforms it to kilometer per liter. If anything other than a numeric vector is entered, the function should display an error (hint: see ?stop).
  4. In the code below, check the sizes of the intermediate objects with object.size().
file_link <- "https://raw.githubusercontent.com/shawn-y-sun/Customer_Analytics_Retail/main/purchase%20data.csv"
df <- read.csv(file_link)
filtered_df <- df[df$Age >= 50,]
aggregated_df <- aggregate(filtered_df$Quantity, by = list(filtered_df$Day), FUN = sum)
names(aggregated_df) <- c("day", "total_quantity")
aggregated_df[order(aggregated_df$total_quantity, decreasing = TRUE)[1:5],]
    day total_quantity
162 162             73
460 460             73
123 123             61
183 183             60
340 340             57
  1. How could the code above be improved if you only want the final result, the code should be readable and you care about memory usage?

Literate Programming

Background

“The language in which we express our ideas has a strong influence on our thought processes.”

― Donald Ervin Knuth, Literate Programming

  • When analysing data in R, a cornerstone of a good workflow is documenting what you are doing.
  • The whole point of doing data analysis in a programming language rather than a point and click tool is reproducibility.
  • Yet if your code does not run after a while and you don’t understand what you were doing when writing the code, it’s as if you had done your whole analysis in Excel!

Advantages

This is where literate programming has a lot of advantages:

  1. Enhanced Documentation: Literate programming combines code and documentation in a single, integrated document. This approach encourages researchers to write clear and comprehensive explanations of their code, making it easier for others (and even themselves) to understand the working of the code, (research) design choices, and logic.
  2. Improved Readability: By structuring code and documentation in a literate programming style, the resulting code becomes more readable and coherent. The narrative flow helps readers follow the thought process and intentions of the programmer, leading to improved comprehension and maintainability.
  3. Modular and Reusable Code: Literate programming emphasizes the organization of code into coherent and reusable chunks as they writers come to think of them similar to paragraphs in a text, where each chunk develops one specific idea.
  4. Collaboration and Communication: Literate programming enhances collaboration among developers by providing a common platform to discuss, share, and review code. The narrative style fosters effective communication, allowing team members to understand the codebase more easily and collaborate more efficiently.
  5. Extensibility and Maintenance: Well-documented literate programs are typically easier to extend and maintain over time. The clear explanation of choices and functionality helps yourself and others in the future to make decisions about modifications, enhancements, and bug fixes.
  6. Reproducibilty and accountability: when you save rendered output of an analysis, you know exactly how a table of plot was created. If there are several versions, you can always turn to the rendered document and check which data, code and package versions were used to do your analuysis (at least when documents were written in a specific way.

Quarto (and its predecessor R Markdown) were designed to make it easy for you to make the most of these advantages. We have already been using these tools throughout the workshop and I hope this made you more familiar with them.

Exercises 4

  1. Use the function report_template() from my package jbgtemplates to start a new report

  2. Add some simple analysis in it and render

  3. Create a new quarto document and use the following yaml header to start your research abstract:

---
title: "Your Research Title"
subtitle: "Abstract Introduction to Web Scraping and Data Management for Social Scientists"
author: Your Name
date: today
format: pdf
---

How to use R and Python together

Why combine Python with R?

Why not just switch to Python?

  1. If you’re here, you probably already know R so why re-learn things from scratch?
  2. R is a programming language specifically for statistics with some great built-in functionality that you would miss in Python.
  3. R has absolutely outstanding packages for data science with no drop-in replacement in Python (e.g., ggplot2, dplyr, tidytext).

Why not just stick with R then?

  1. Many computational tools are not nativly available in R (e.g., for browser emulation and machine learning) as advancements are made by software engeneers and companies who rely on Python
  2. You might want to collaborate with someone who uses Python and need to run their code
  3. Learning a new (programming) language is always good to extend your skills (also in your the language(s) you already know)

Setting up Python

(Try to) Find Python

Before you load reticulate for the first time, we need to create a virtual environment (and potentially install a version of Python). This is a folder in your project directory with a link to Python and the packages you want to use in this project. Why?

  • Packages (or their dependencies) on the Python Package Index can be incompatible with each other – meaning you can break things by updating.
  • Your operating system might keep older versions of some packages around, which means you could break your OS by and accidental update!
  • This also adds to projects being reproducible on other systems, as you keep track of the specific version of each package used in your project (you could do this in R with the renv package).

The first step is to check if Python is availabe already and to find where it is located on your system:

if (R.Version()$os == "mingw32") {
  system("where python") # for Windows
} else {
  system("whereis python")
}

(If Nothing is Found) Install Python

The easiest way to install Python for R projects is through reticulate (it also causes issues regulary though, so consider using your package manager):

reticulate::install_miniconda()

Note, however, that your user name can not contain space or special characters. If that is the case, you should install miniconda on a different location than the default. For example reticulate::install_miniconda(path = "C:/tools/miniconda") (you need to create the folder C:/tools manually). Also note that system("whereis python") will not pick up this installation. Instead you can find the path using:

reticulate::miniconda_path()
[1] "/home/johannes/.local/share/r-miniconda"

Create a Virtual Environment

To do this, you first have to indicate the location where your Python executable lives (this path should always end in /bin/python or python.exe on Windows):

python_location <- "/home/johannes/.local/share/r-miniconda/bin/python"
# for Windows the path is usually "C:/Users/{user}/AppData/Local/r-miniconda/python.exe"

Then we can create a new virtual environment in the project folder:

# I build in this if condition to not accidentally overwrite the environment when rerunning the notebook
if (!reticulate::virtualenv_exists(envname = "../python-env/")) {
  reticulate::virtualenv_create("../python-env/", python = python_location)
}
reticulate::virtualenv_exists(envname = "../python-env/")
[1] TRUE

Make sure the Right Environment is Loaded

if (R.Version()$os == "mingw32") {
  python_path <- "../python-env/Scripts/python.exe"
} else {
  python_path <- "../python-env/bin/python"
}
python_path
[1] "../python-env/bin/python"
file.exists(python_path)
[1] TRUE
Sys.setenv(RETICULATE_PYTHON = python_path)

We can write this to your .Renviron file (otherwise the Sys.setenv() line above needs to be in every script). Note: the variables in the .Renviron file are set when R is started.

usethis::edit_r_environ(scope = "project")

The file should look something like this:

RETICULATE_PYTHON=/home/johannes/Dropbox/Teaching/ess-introduction-to-web-scraping/python-env/bin/python

Load reticulate and See if it is Working

library(reticulate)
py_config()
python:         /home/johannes/Dropbox/Teaching/ess-introduction-to-web-scraping/python-env/bin/python
libpython:      /usr/lib/libpython3.11.so
pythonhome:     /home/johannes/Dropbox/Teaching/ess-introduction-to-web-scraping/python-env:/home/johannes/Dropbox/Teaching/ess-introduction-to-web-scraping/python-env
version:        3.11.3 (main, Jun  5 2023, 09:32:32) [GCC 13.1.1 20230429]
numpy:          /home/johannes/Dropbox/Teaching/ess-introduction-to-web-scraping/python-env/lib/python3.11/site-packages/numpy
numpy_version:  1.25.1

NOTE: Python version was forced by RETICULATE_PYTHON

Installing Python Packages

reticulate::py_install() installs package similar to install.packages(). Let’s install the packages we need:

reticulate::py_install(c(
  "playwright",
  "xvfbwrapper"
))

But there are some caveats:

  • not all packages can be installed with the name you see in scripts (e.g.,to install the package, call “scikit-learn”, to load it you need sklearn)
  • you might need a specific version of a package to follow a specific tutorial
  • there can be different flavours of the same package (e.g., bertopic, bertopic[gensim], bertopic[spacy])
  • you will get a cryptic warning if you attempt to install base Python packages
reticulate::py_install("os")
Using virtual environment '/home/johannes/Dropbox/Teaching/ess-introduction-to-web-scraping/python-env' ...
Error: Error installing package(s): "'os'"

Installing Python Packages

If you see the $ in the beginning, these are command line/bash commands. Use the ```{bash} chunk option to run these commands and use the pip and python versions in your virtual environment (you could also activate the environment instead).

```{bash}
#| eval: false
./python-env/bin/pip install -U pip setuptools wheel
./python-env/bin/pip install -U 'spacy'
./python-env/bin/python -m spacy download en_core_web_sm
./python-env/bin/python -m spacy download de_core_news_sm
```

On Windows, the binary files are in a different location:

```{bash}
#| eval: false
./python-env/Scripts/pip.exe install -U pip setuptools wheel
./python-env/Scripts/pip.exe install -U 'spacy'
./python-env/Scripts/python.exe -m spacy download en_core_web_sm
./python-env/Scripts/python.exe -m spacy download de_core_news_sm
```

General tip: see if the software distributor has instructions, like the excellent ones from spacy:

Workflow

Use in Quarto

In my opinion, a nice workflow is to use R and Python together in a Quarto Document. All you need to do to tell Quarto to run a Python, instead of an R chunk is to replace ```{r} with ```{python}.

```{r}
text <- "Hello World! From R"
print(text)
```
[1] "Hello World! From R"
```{python}
text = "Hello World! From Python"
print(text)
```
Hello World! From Python

Shortcut

You can even set up a shortcut to make these chunks (I like Ctrl+Alt+P):

To get an interactive Python session in your Console, you can use reticulate::repl_python().

As you’ve seen above, the code is pretty similar, with a few key differences:

Syntax

  • = instead of <-
  • code formatting is part of the syntax!
  • base Python does not have data.frame class, instead you have dictionaries or the DataFrame from the Pandas package
  • Python lists are the equivalent of R vectors
  • the *apply family of functions and vectorised code does not exist as such – everything is a for loop!
  • a lot of packages are writing object oriented instead of functional code
  • many more!
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
my_list + 2 # does not work in Python
can only concatenate list (not "int") to list
for i in my_list:
    print(i + 2)
3
4
5
6
7
8
9
10
11
12
my_dict = {'name': ['John', 'Jane', 'Jim', 'Joan'],
          'age': [32, 28, 40, 35],
          'city': ['New York', 'London', 'Paris', 'Berlin']}
my_dict
{'name': ['John', 'Jane', 'Jim', 'Joan'], 'age': [32, 28, 40, 35], 'city': ['New York', 'London', 'Paris', 'Berlin']}

reticulate Magic

The truly magical thing about reticulate is how seamless it hands objects back and forth between Python and R:

py$text
[1] "Hello World! From Python"
py$my_list
 [1]  1  2  3  4  5  6  7  8  9 10
py$my_dict
$name
[1] "John" "Jane" "Jim"  "Joan"

$age
[1] 32 28 40 35

$city
[1] "New York" "London"   "Paris"    "Berlin"  
my_df <- data.frame(num = 1:10,
                    let = LETTERS[1:10])
my_list <- list(df = my_df, 11:20)
r.text
'Hello World! From R'
r.my_df
{'num': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'let': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']}
r.my_list
{'df': {'num': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'let': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']}, '': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]}

Functions

What I think is especially cool is that this even works with functions:

def hello(x=None):
  """
  :param x: name of the person to say hello to.
  """
  if not x:
    print("Hello World!")
  else:
    print("Hello " + x + "!")
py$hello()
py$hello("Class")
reticulate::py_help(py$hello)

Exercises 5

  1. Write a Python function that takes a numeric vector of miles per gallon consumption data and transforms it to kilometer per liter. Use the function from within R
  2. Use the function in Python, but using a vector defined in R

Homework

You did not come to class to just scrape exercise pages. You probably had some initial data and/or research question in mind. Please write a short abstract (~200-400 words) on what you want to accomplish with the web scraping skill you will learn here, so we can try and incorporate the necessary tools in one of the sessions this week. The abstract should include what data can be found on the website and what potential research quesions you have in mind.

Deadline: Tuesday midnight

Wrap Up

Save some information about the session for reproducibility.

Show Session Info
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.11.0 
LAPACK: /usr/lib/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Amsterdam
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] reticulate_1.30 lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0  
 [5] purrr_1.0.1     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
 [9] ggplot2_3.4.2   tidyverse_2.0.0 dplyr_1.1.2    

loaded via a namespace (and not attached):
 [1] Matrix_1.6-0      gtable_0.3.3      jsonlite_1.8.7    compiler_4.3.1   
 [5] Rcpp_1.0.11       tidyselect_1.2.0  png_0.1-8         scales_1.2.1     
 [9] yaml_2.3.7        fastmap_1.1.1     lattice_0.21-8    R6_2.5.1         
[13] generics_0.1.3    knitr_1.43        munsell_0.5.0     pillar_1.9.0     
[17] tzdb_0.4.0        rlang_1.1.1       utf8_1.2.3        stringi_1.7.12   
[21] xfun_0.39         timechange_0.2.0  cli_3.6.1         withr_2.5.0      
[25] magrittr_2.0.3    digest_0.6.33     grid_4.3.1        rstudioapi_0.15.0
[29] rappdirs_0.3.3    hms_1.1.3         lifecycle_1.0.3   vctrs_0.6.3      
[33] evaluate_0.21     glue_1.6.2        codetools_0.2-19  fansi_1.0.4      
[37] colorspace_2.1-0  rmarkdown_2.23    tools_4.3.1       pkgconfig_2.0.3  
[41] htmltools_0.5.5